This lesson is intended to serve as a reference for
early tidyverse users. Instead of focusing on what’s under the
hood (i.e., why functions work the way that they do), the focus of
this document is to help users determine which functions to use in
different scenarios.
My hope is that this document will help limit the time that users have to ask “which function do I use for that?” As such, the material below is arranged by task.
Note: For easier navigation, I recommend opening this document in your browser window rather than the R Studio viewer pane.
Below is some handy reference material that you can refer to throughout this lesson. Please click on the menu item to expand it.
Throughout this class, I will refer to the panes (sections) of
the R Studio window. This graphic should help you remember them:
The Keyboard shortcuts we will use in this lesson are listed
below for Windows and Mac operating systems.
I strongly suggest using a keyboard shortcut whenever possible. It can be hard to do at first, using your mouse to click a button or typing a few characters is initially way easier than remembering the keyboard shortcut. If you force yourself to do it though, you won’t regret it – it saves a lot of time in the end.
Throughout this lesson, I will use terminology (i.e., jargon)
that may be new to you. This menu item can be clicked at any point if
you an encounter a term that you do not know or understand. Be sure to
visit this menu for each lesson, because new items will be added (Never
removed) as the course content expands.
Throughout this lesson and others there are several bash
commands you will have to use in the Terminal tab of your
Console pane. These include:
I strongly recommend downloading and printing the various
cheatsheets that have been built by the team at RStudio. They’re super
helpful!
Environments, data structures, functions, subsetting, and more are covered in this handy cheatsheet written by Arianne Colton and Sean Chen. Updated Feb 16. Download
The Data Import cheatsheet reminds you how to read in flat files with http://readr.tidyverse.org/, work with the results as tibbles, and reshape messy data with tidyr. Use tidyr to reshape your tables into tidy data, the data format that works the most seamlessly with R and the tidyverse. Updated January 17. Download
tidyr provides you with tools to transform untidy data into tidy data in R. This cheatsheet will guide you through the grammar, reminding you how to join, filter, arrange, mutate, summarise, group, and join data frames and tibbles. (Previous version) Updated January 17. Download
dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, giving you a quick reference on how to select, filter, arrange, mutate, summarize, group, and join data frames and tibbles. (Previous version) Updated January 17. Download
The ggplot2 package lets you make beautiful and customizable plots of your data. It implements the grammar of graphics, an easy to use system for building plots. See ggplot2.tidyverse.org for more. Updated November 16. Download
The RStudio IDE is the most popular integrated development environment for R. Do you want to write, run, and debug your own R code? Work collaboratively on R projects with version control? Build packages or create documents and apps? No matter what you do with R, the RStudio IDE can help you do it faster. This cheatsheet will guide you through the most useful features of the IDE, as well as the long list of keyboard shortcuts built into the RStudio IDE. Updated January 16. Download
Open R Studio and a new script file. Remember that it’s best practice to start with a clean R Studio session!
Add a new code section and call it “setup”
After a space between your section break, include and run the following:
library(lubridate)
library(tidyverse)
read_rds('data/raw/district_birds.rds') %>%
list2env(.GlobalEnv)
The select() family of functions is used to subset data
frame columns.
Let’s take a look at the full captures data frame
(Note: You may need to modify the width of your viewer pane to see
all of the columns at once):
captures
## # A tibble: 6,835 × 12
## capture_id visit_id band_…¹ color…² spp sex age wing tl mass bp_cp
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 9kIB37f b1BQ67f… 2261-3… B_WX SOSP M AHY 63 NA 23.9 <NA>
## 2 9XKm7o1 b1BQ67f… 1991-4… M_XP GRCA M AHY 90 NA 37.4 <NA>
## 3 6J74507 b1BQ67f… 2261-3… XYG_ SOSP M AHY 67 NA 25.9 <NA>
## 4 7E5T42L b1BQ67f… 1991-4… XOM_ NOCA M AHY 93 NA 42.2 <NA>
## 5 0551Y4q 76569Hj… 2411-3… X_PK GRCA M AHY 85 NA 33 <NA>
## 6 FJ72332 76569Hj… 2411-3… ORX_ NOCA F AHY 91 NA 45 <NA>
## 7 BD1A0kZ 76569Hj… 2411-3… BXP_ GRCA M AHY 87 NA 45 <NA>
## 8 136wur8 76569Hj… 2341-2… X_YK CARW U HY NA NA 22 <NA>
## 9 U8A8454 76569Hj… 1202-9… MXO_ AMRO F AHY 117 NA 91 <NA>
## 10 Q15TLAw 76569Hj… 1202-9… WKX_ AMRO M AHY 139 NA 75 <NA>
## # … with 6,825 more rows, 1 more variable: fat <chr>, and abbreviated variable
## # names ¹band_number, ²color_combo
Single column: To select a single column, we can provide a one-value vector with the name of the column (quotes are not necessary!):
captures %>%
select(band_number)
## # A tibble: 6,835 × 1
## band_number
## <chr>
## 1 2261-38348
## 2 1991-46167
## 3 2261-38347
## 4 1991-46166
## 5 2411-37708
## 6 2411-37710
## 7 2411-37709
## 8 2341-29331
## 9 1202-99076
## 10 1202-99075
## # … with 6,825 more rows
Adjacent columns: We can select a vector of adjacent columns by adding a colon between the first and last column of interest:
captures %>%
select(band_number:age)
## # A tibble: 6,835 × 5
## band_number color_combo spp sex age
## <chr> <chr> <chr> <chr> <chr>
## 1 2261-38348 B_WX SOSP M AHY
## 2 1991-46167 M_XP GRCA M AHY
## 3 2261-38347 XYG_ SOSP M AHY
## 4 1991-46166 XOM_ NOCA M AHY
## 5 2411-37708 X_PK GRCA M AHY
## 6 2411-37710 ORX_ NOCA F AHY
## 7 2411-37709 BXP_ GRCA M AHY
## 8 2341-29331 X_YK CARW U HY
## 9 1202-99076 MXO_ AMRO F AHY
## 10 1202-99075 WKX_ AMRO M AHY
## # … with 6,825 more rows
Non-adjacent columns: We can select a vector of
non-adjacent columns using the combine function c to
specify our vector of column names:
captures %>%
select(c(band_number, age))
## # A tibble: 6,835 × 2
## band_number age
## <chr> <chr>
## 1 2261-38348 AHY
## 2 1991-46167 AHY
## 3 2261-38347 AHY
## 4 1991-46166 AHY
## 5 2411-37708 AHY
## 6 2411-37710 AHY
## 7 2411-37709 AHY
## 8 2341-29331 HY
## 9 1202-99076 AHY
## 10 1202-99075 AHY
## # … with 6,825 more rows
Sets of columns: We can select non-adjacent and adjacent columns in a single select operation:
captures %>%
select(
c(capture_id,
band_number:age))
## # A tibble: 6,835 × 6
## capture_id band_number color_combo spp sex age
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9kIB37f 2261-38348 B_WX SOSP M AHY
## 2 9XKm7o1 1991-46167 M_XP GRCA M AHY
## 3 6J74507 2261-38347 XYG_ SOSP M AHY
## 4 7E5T42L 1991-46166 XOM_ NOCA M AHY
## 5 0551Y4q 2411-37708 X_PK GRCA M AHY
## 6 FJ72332 2411-37710 ORX_ NOCA F AHY
## 7 BD1A0kZ 2411-37709 BXP_ GRCA M AHY
## 8 136wur8 2341-29331 X_YK CARW U HY
## 9 U8A8454 1202-99076 MXO_ AMRO F AHY
## 10 Q15TLAw 1202-99075 WKX_ AMRO M AHY
## # … with 6,825 more rows
… or multiple sets of adjacent columns:
captures %>%
select(
c(capture_id,
band_number:color_combo,
wing:mass))
## # A tibble: 6,835 × 6
## capture_id band_number color_combo wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 9kIB37f 2261-38348 B_WX 63 NA 23.9
## 2 9XKm7o1 1991-46167 M_XP 90 NA 37.4
## 3 6J74507 2261-38347 XYG_ 67 NA 25.9
## 4 7E5T42L 1991-46166 XOM_ 93 NA 42.2
## 5 0551Y4q 2411-37708 X_PK 85 NA 33
## 6 FJ72332 2411-37710 ORX_ 91 NA 45
## 7 BD1A0kZ 2411-37709 BXP_ 87 NA 45
## 8 136wur8 2341-29331 X_YK NA NA 22
## 9 U8A8454 1202-99076 MXO_ 117 NA 91
## 10 Q15TLAw 1202-99075 WKX_ 139 NA 75
## # … with 6,825 more rows
The above could have also been written as:
captures %>%
select(
c(capture_id,
band_number,
color_combo,
wing:mass))
## # A tibble: 6,835 × 6
## capture_id band_number color_combo wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 9kIB37f 2261-38348 B_WX 63 NA 23.9
## 2 9XKm7o1 1991-46167 M_XP 90 NA 37.4
## 3 6J74507 2261-38347 XYG_ 67 NA 25.9
## 4 7E5T42L 1991-46166 XOM_ 93 NA 42.2
## 5 0551Y4q 2411-37708 X_PK 85 NA 33
## 6 FJ72332 2411-37710 ORX_ 91 NA 45
## 7 BD1A0kZ 2411-37709 BXP_ 87 NA 45
## 8 136wur8 2341-29331 X_YK NA NA 22
## 9 U8A8454 1202-99076 MXO_ 117 NA 91
## 10 Q15TLAw 1202-99075 WKX_ 139 NA 75
## # … with 6,825 more rows
Remove a single column: A negated selection will
return all columns except for those specified by the negation operator,
the !. Here, we’ll remove the column
band_number:
captures %>%
select(!band_number)
## # A tibble: 6,835 × 11
## capture_id visit_id color…¹ spp sex age wing tl mass bp_cp fat
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9kIB37f b1BQ67f46… B_WX SOSP M AHY 63 NA 23.9 <NA> <NA>
## 2 9XKm7o1 b1BQ67f46… M_XP GRCA M AHY 90 NA 37.4 <NA> <NA>
## 3 6J74507 b1BQ67f46… XYG_ SOSP M AHY 67 NA 25.9 <NA> <NA>
## 4 7E5T42L b1BQ67f46… XOM_ NOCA M AHY 93 NA 42.2 <NA> <NA>
## 5 0551Y4q 76569Hj4a… X_PK GRCA M AHY 85 NA 33 <NA> <NA>
## 6 FJ72332 76569Hj4a… ORX_ NOCA F AHY 91 NA 45 <NA> <NA>
## 7 BD1A0kZ 76569Hj4a… BXP_ GRCA M AHY 87 NA 45 <NA> <NA>
## 8 136wur8 76569Hj4a… X_YK CARW U HY NA NA 22 <NA> <NA>
## 9 U8A8454 76569Hj4a… MXO_ AMRO F AHY 117 NA 91 <NA> <NA>
## 10 Q15TLAw 76569Hj4a… WKX_ AMRO M AHY 139 NA 75 <NA> <NA>
## # … with 6,825 more rows, and abbreviated variable name ¹color_combo
Remove sets of columns: The negation operator can be used with multiple columns as well.
We can remove a set of adjacent columns using : to
specify a range of columns to remove:
captures %>%
select(!band_number:age)
## # A tibble: 6,835 × 7
## capture_id visit_id wing tl mass bp_cp fat
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9kIB37f b1BQ67f46a93580 63 NA 23.9 <NA> <NA>
## 2 9XKm7o1 b1BQ67f46a93580 90 NA 37.4 <NA> <NA>
## 3 6J74507 b1BQ67f46a93580 67 NA 25.9 <NA> <NA>
## 4 7E5T42L b1BQ67f46a93580 93 NA 42.2 <NA> <NA>
## 5 0551Y4q 76569Hj4a1dd400 85 NA 33 <NA> <NA>
## 6 FJ72332 76569Hj4a1dd400 91 NA 45 <NA> <NA>
## 7 BD1A0kZ 76569Hj4a1dd400 87 NA 45 <NA> <NA>
## 8 136wur8 76569Hj4a1dd400 NA NA 22 <NA> <NA>
## 9 U8A8454 76569Hj4a1dd400 117 NA 91 <NA> <NA>
## 10 Q15TLAw 76569Hj4a1dd400 139 NA 75 <NA> <NA>
## # … with 6,825 more rows
To remove a set of non-adjacent columns, we have to wrap our column
names in the combine function, c:
captures %>%
select(!c(band_number, age))
## # A tibble: 6,835 × 10
## capture_id visit_id color_…¹ spp sex wing tl mass bp_cp fat
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9kIB37f b1BQ67f46a93580 B_WX SOSP M 63 NA 23.9 <NA> <NA>
## 2 9XKm7o1 b1BQ67f46a93580 M_XP GRCA M 90 NA 37.4 <NA> <NA>
## 3 6J74507 b1BQ67f46a93580 XYG_ SOSP M 67 NA 25.9 <NA> <NA>
## 4 7E5T42L b1BQ67f46a93580 XOM_ NOCA M 93 NA 42.2 <NA> <NA>
## 5 0551Y4q 76569Hj4a1dd400 X_PK GRCA M 85 NA 33 <NA> <NA>
## 6 FJ72332 76569Hj4a1dd400 ORX_ NOCA F 91 NA 45 <NA> <NA>
## 7 BD1A0kZ 76569Hj4a1dd400 BXP_ GRCA M 87 NA 45 <NA> <NA>
## 8 136wur8 76569Hj4a1dd400 X_YK CARW U NA NA 22 <NA> <NA>
## 9 U8A8454 76569Hj4a1dd400 MXO_ AMRO F 117 NA 91 <NA> <NA>
## 10 Q15TLAw 76569Hj4a1dd400 WKX_ AMRO M 139 NA 75 <NA> <NA>
## # … with 6,825 more rows, and abbreviated variable name ¹color_combo
By value class: We can subset columns using a
logical test. For example, the below subsets the data to columns
where() the column class is numeric:
captures %>%
select(
where(is.numeric))
## # A tibble: 6,835 × 3
## wing tl mass
## <dbl> <dbl> <dbl>
## 1 63 NA 23.9
## 2 90 NA 37.4
## 3 67 NA 25.9
## 4 93 NA 42.2
## 5 85 NA 33
## 6 91 NA 45
## 7 87 NA 45
## 8 NA NA 22
## 9 117 NA 91
## 10 139 NA 75
## # … with 6,825 more rows
… and the below subsets the data to columns where() the
column class is character:
captures %>%
select(
where(is.character))
## # A tibble: 6,835 × 9
## capture_id visit_id band_number color_…¹ spp sex age bp_cp fat
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9kIB37f b1BQ67f46a93580 2261-38348 B_WX SOSP M AHY <NA> <NA>
## 2 9XKm7o1 b1BQ67f46a93580 1991-46167 M_XP GRCA M AHY <NA> <NA>
## 3 6J74507 b1BQ67f46a93580 2261-38347 XYG_ SOSP M AHY <NA> <NA>
## 4 7E5T42L b1BQ67f46a93580 1991-46166 XOM_ NOCA M AHY <NA> <NA>
## 5 0551Y4q 76569Hj4a1dd400 2411-37708 X_PK GRCA M AHY <NA> <NA>
## 6 FJ72332 76569Hj4a1dd400 2411-37710 ORX_ NOCA F AHY <NA> <NA>
## 7 BD1A0kZ 76569Hj4a1dd400 2411-37709 BXP_ GRCA M AHY <NA> <NA>
## 8 136wur8 76569Hj4a1dd400 2341-29331 X_YK CARW U HY <NA> <NA>
## 9 U8A8454 76569Hj4a1dd400 1202-99076 MXO_ AMRO F AHY <NA> <NA>
## 10 Q15TLAw 76569Hj4a1dd400 1202-99075 WKX_ AMRO M AHY <NA> <NA>
## # … with 6,825 more rows, and abbreviated variable name ¹color_combo
Partial string match: We can also search for strings
in columns names when subsetting using the contains()
function. Here, we subset the data to any column that contains the
string _id:
captures %>%
select(
contains('_id'))
## # A tibble: 6,835 × 2
## capture_id visit_id
## <chr> <chr>
## 1 9kIB37f b1BQ67f46a93580
## 2 9XKm7o1 b1BQ67f46a93580
## 3 6J74507 b1BQ67f46a93580
## 4 7E5T42L b1BQ67f46a93580
## 5 0551Y4q 76569Hj4a1dd400
## 6 FJ72332 76569Hj4a1dd400
## 7 BD1A0kZ 76569Hj4a1dd400
## 8 136wur8 76569Hj4a1dd400
## 9 U8A8454 76569Hj4a1dd400
## 10 Q15TLAw 76569Hj4a1dd400
## # … with 6,825 more rows
… and here we use the negation operator to subset to any column that
does not contain the string _id:
captures %>%
select(
!contains('_id'))
## # A tibble: 6,835 × 10
## band_number color_combo spp sex age wing tl mass bp_cp fat
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 2261-38348 B_WX SOSP M AHY 63 NA 23.9 <NA> <NA>
## 2 1991-46167 M_XP GRCA M AHY 90 NA 37.4 <NA> <NA>
## 3 2261-38347 XYG_ SOSP M AHY 67 NA 25.9 <NA> <NA>
## 4 1991-46166 XOM_ NOCA M AHY 93 NA 42.2 <NA> <NA>
## 5 2411-37708 X_PK GRCA M AHY 85 NA 33 <NA> <NA>
## 6 2411-37710 ORX_ NOCA F AHY 91 NA 45 <NA> <NA>
## 7 2411-37709 BXP_ GRCA M AHY 87 NA 45 <NA> <NA>
## 8 2341-29331 X_YK CARW U HY NA NA 22 <NA> <NA>
## 9 1202-99076 MXO_ AMRO F AHY 117 NA 91 <NA> <NA>
## 10 1202-99075 WKX_ AMRO M AHY 139 NA 75 <NA> <NA>
## # … with 6,825 more rows
Using select: During the column selection process, we can change the name of a selected column:
captures %>%
select(band_number,
color_combo,
species = spp)
## # A tibble: 6,835 × 3
## band_number color_combo species
## <chr> <chr> <chr>
## 1 2261-38348 B_WX SOSP
## 2 1991-46167 M_XP GRCA
## 3 2261-38347 XYG_ SOSP
## 4 1991-46166 XOM_ NOCA
## 5 2411-37708 X_PK GRCA
## 6 2411-37710 ORX_ NOCA
## 7 2411-37709 BXP_ GRCA
## 8 2341-29331 X_YK CARW
## 9 1202-99076 MXO_ AMRO
## 10 1202-99075 WKX_ AMRO
## # … with 6,825 more rows
Using rename: If we do not need to subset the
columns, we may want to choose the rename() function
instead:
captures %>%
rename(species = spp)
## # A tibble: 6,835 × 12
## capture…¹ visit…² band_…³ color…⁴ species sex age wing tl mass bp_cp
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 9kIB37f b1BQ67… 2261-3… B_WX SOSP M AHY 63 NA 23.9 <NA>
## 2 9XKm7o1 b1BQ67… 1991-4… M_XP GRCA M AHY 90 NA 37.4 <NA>
## 3 6J74507 b1BQ67… 2261-3… XYG_ SOSP M AHY 67 NA 25.9 <NA>
## 4 7E5T42L b1BQ67… 1991-4… XOM_ NOCA M AHY 93 NA 42.2 <NA>
## 5 0551Y4q 76569H… 2411-3… X_PK GRCA M AHY 85 NA 33 <NA>
## 6 FJ72332 76569H… 2411-3… ORX_ NOCA F AHY 91 NA 45 <NA>
## 7 BD1A0kZ 76569H… 2411-3… BXP_ GRCA M AHY 87 NA 45 <NA>
## 8 136wur8 76569H… 2341-2… X_YK CARW U HY NA NA 22 <NA>
## 9 U8A8454 76569H… 1202-9… MXO_ AMRO F AHY 117 NA 91 <NA>
## 10 Q15TLAw 76569H… 1202-9… WKX_ AMRO M AHY 139 NA 75 <NA>
## # … with 6,825 more rows, 1 more variable: fat <chr>, and abbreviated variable
## # names ¹capture_id, ²visit_id, ³band_number, ⁴color_combo
Now you!
Subset captures to the fields band number, spp, sex, wing, tl (tail length), and mass. Assign the resultant object to your global environment with the key
measures.You can see my answer below by clicking the “Code” button. Give this an honest try before you do though!
measures <- captures %>% select(band_number, spp:sex, wing:mass)
A decision that you’ll sometimes have to make when subsetting is whether to subset by providing the names of your columns of interest or with negation. I typically choose the method that requires the least amount of coding.
Being able to re-arrange data frames can be super important when exploring tabular data or preparing the data to be plotted.
Reorder columns with select: We can use the
select() function to reorder columns while we select them
(for measures see “Now you!”, above):
measures %>%
select(spp,
band_number:mass)
## # A tibble: 6,835 × 6
## spp band_number sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 SOSP 2261-38348 M 63 NA 23.9
## 2 GRCA 1991-46167 M 90 NA 37.4
## 3 SOSP 2261-38347 M 67 NA 25.9
## 4 NOCA 1991-46166 M 93 NA 42.2
## 5 GRCA 2411-37708 M 85 NA 33
## 6 NOCA 2411-37710 F 91 NA 45
## 7 GRCA 2411-37709 M 87 NA 45
## 8 CARW 2341-29331 U NA NA 22
## 9 AMRO 1202-99076 F 117 NA 91
## 10 AMRO 1202-99075 M 139 NA 75
## # … with 6,825 more rows
If we have a lot of columns, or want to make our code more
generalize-able, we can use the everything() function. The
code below says to select spp and then every other column
after that.
measures %>%
select(spp, everything())
## # A tibble: 6,835 × 6
## spp band_number sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 SOSP 2261-38348 M 63 NA 23.9
## 2 GRCA 1991-46167 M 90 NA 37.4
## 3 SOSP 2261-38347 M 67 NA 25.9
## 4 NOCA 1991-46166 M 93 NA 42.2
## 5 GRCA 2411-37708 M 85 NA 33
## 6 NOCA 2411-37710 F 91 NA 45
## 7 GRCA 2411-37709 M 87 NA 45
## 8 CARW 2341-29331 U NA NA 22
## 9 AMRO 1202-99076 F 117 NA 91
## 10 AMRO 1202-99075 M 139 NA 75
## # … with 6,825 more rows
Move the location of a column: We can also just move
a column to the front using the relocate() function:
measures %>%
relocate(spp)
## # A tibble: 6,835 × 6
## spp band_number sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 SOSP 2261-38348 M 63 NA 23.9
## 2 GRCA 1991-46167 M 90 NA 37.4
## 3 SOSP 2261-38347 M 67 NA 25.9
## 4 NOCA 1991-46166 M 93 NA 42.2
## 5 GRCA 2411-37708 M 85 NA 33
## 6 NOCA 2411-37710 F 91 NA 45
## 7 GRCA 2411-37709 M 87 NA 45
## 8 CARW 2341-29331 U NA NA 22
## 9 AMRO 1202-99076 F 117 NA 91
## 10 AMRO 1202-99075 M 139 NA 75
## # … with 6,825 more rows
Arranging data frame by row, using the arrange()
function can provide a powerful tool for data frame exploration.
Arrange from lowest to highest values: By default,
arrange() sorts a data frame from the lowest to highest
value in a specified column. The following arranges
measures from lowest to highest mass
measurement:
measures %>%
arrange(mass)
## # A tibble: 6,835 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-41582 CACH U 62 56 6.3
## 2 2460-22927 CACH F 60 NA 7
## 3 2750-44514 HOWR U 26.5 NA 7.1
## 4 2810-03340 HOWR M 49.5 40 7.1
## 5 2430-29237 CACH U 63 NA 7.2
## 6 2250-24271 CACH F NA NA 7.3
## 7 2750-41580 HOWR M 51 44 7.4
## 8 2750-41591 CACH U 58 56 7.5
## 9 2810-03341 HOWR F 47 44 7.5
## 10 2590-82234 CACH U 58 NA 7.6
## # … with 6,825 more rows
Arrange from highest to lowest values: We can
arrange a data frame in descending order using the desc()
function nested inside of arrange():
measures %>%
arrange(
desc(mass))
## # A tibble: 6,835 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1352-46513 AMRO F 125 NA 111.
## 2 1573-45513 COGR F 125 106. 110.
## 3 1573-45514 COGR F 125 111 107.
## 4 1342-72973 AMRO M 124 NA 102
## 5 1272-43002 AMRO F 121 NA 101
## 6 1352-46575 AMRO F 121 93 98.7
## 7 1272-43091 AMRO F 127 NA 97.8
## 8 1352-46269 AMRO F 125 NA 96
## 9 1232-94730 AMRO F 121 NA 95.1
## 10 1232-94720 AMRO M 128 NA 94.5
## # … with 6,825 more rows
We often want to convert data frames from wide to long format. This is often an important operation in data tidying (because many data frames built in Excel are an untidy wide format). It also can be useful when we want to plot several variables in a single plot (e.g., with variables mapped to different facts of a plot or a color aesthetic mapped to a given variable).
To do so, we use pivot_longer. We supply the data frame
we are pivoting (here that data frame is piped in), the selection of
columns to pivot, the name for the variable classes (from the column
names), and the name of the value column.
measures %>%
pivot_longer(
wing:mass,
names_to = 'measurement',
values_to = 'value')
## # A tibble: 20,505 × 5
## band_number spp sex measurement value
## <chr> <chr> <chr> <chr> <dbl>
## 1 2261-38348 SOSP M wing 63
## 2 2261-38348 SOSP M tl NA
## 3 2261-38348 SOSP M mass 23.9
## 4 1991-46167 GRCA M wing 90
## 5 1991-46167 GRCA M tl NA
## 6 1991-46167 GRCA M mass 37.4
## 7 2261-38347 SOSP M wing 67
## 8 2261-38347 SOSP M tl NA
## 9 2261-38347 SOSP M mass 25.9
## 10 1991-46166 NOCA M wing 93
## # … with 20,495 more rows
We can also pivot to a wider data frame. This is sometimes useful in data tidying (when multiple variables are stored in one column), but is most often used when preparing data for analysis or when you need to visually explore tabular data.
To pivot a data frame to a wider format we use the function
pivot_wider(). Here, we supply the data frame, the name of
the column that contains our variables of interest.
We’re going to use the counts data frame for this
(the above has a problem that we’ll address in the next
segment) to create a wide frame table of counts by visit.
counts %>%
select(-count_id) %>%
pivot_wider(
names_from = spp,
values_from = count) %>%
select(visit_id:NOCA)
## # A tibble: 316 × 5
## visit_id AMRO GRCA MODO NOCA
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 3Fxh65U4fb43f80 2 2 1 2
## 2 8iXAa9851afd100 1 1 1 4
## 3 150jN34577eed00 1 NA NA 4
## 4 0n075TL4ff23600 4 NA 2 2
## 5 G686H0051edc780 3 1 1 4
## 6 jrS1z12576dc980 1 NA NA 1
## 7 0dg6s68592e0780 6 1 NA 2
## 8 q1fkjW85b15d280 5 NA NA NA
## 9 24tsYn85ce33f80 3 NA NA 8
## 10 Nbq2mK25ed6e800 9 1 NA 2
## # … with 306 more rows
Note: I included the select() function at the end
just so this didn’t make a mess when it printed.
This method can be used to quickly format data for many common
community analyses. Of course, all of those NA values in there should
really be zeros (the species was not observed during the visit). We can
add the values_fill argument to pivot_wider()
to replace those values during pivoting.
counts %>%
select(-count_id) %>%
pivot_wider(
names_from = spp,
values_from = count,
values_fill = 0) %>%
select(visit_id:NOCA)
## # A tibble: 316 × 5
## visit_id AMRO GRCA MODO NOCA
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 3Fxh65U4fb43f80 2 2 1 2
## 2 8iXAa9851afd100 1 1 1 4
## 3 150jN34577eed00 1 0 0 4
## 4 0n075TL4ff23600 4 0 2 2
## 5 G686H0051edc780 3 1 1 4
## 6 jrS1z12576dc980 1 0 0 1
## 7 0dg6s68592e0780 6 1 0 2
## 8 q1fkjW85b15d280 5 0 0 0
## 9 24tsYn85ce33f80 3 0 0 8
## 10 Nbq2mK25ed6e800 9 1 0 2
## # … with 306 more rows
The most straightforward (and sometimes useful) way to subset the rows of a data frame is to extract rows by position.
Slice a row by position: Using the function
slice(), we simply provide the row number that we would
like to retrieve:
measures %>%
slice(1)
## # A tibble: 1 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348 SOSP M 63 NA 23.9
Slice adjacent rows by position: To slice adjacent
rows, we provide a vector of row numbers, with lower and upper values
separated with a : operator:
measures %>%
slice(1:5)
## # A tibble: 5 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348 SOSP M 63 NA 23.9
## 2 1991-46167 GRCA M 90 NA 37.4
## 3 2261-38347 SOSP M 67 NA 25.9
## 4 1991-46166 NOCA M 93 NA 42.2
## 5 2411-37708 GRCA M 85 NA 33
Slice non-adjacent rows by position: For
non-adjacent rows, we supply our vector of rows using the combine
function, c():
measures %>%
slice(
c(2, 3, 5))
## # A tibble: 3 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1991-46167 GRCA M 90 NA 37.4
## 2 2261-38347 SOSP M 67 NA 25.9
## 3 2411-37708 GRCA M 85 NA 33
Subset the data to the top row(s): The function
slice_head() can be a handy tool for just grabbing the top
row:
measures %>%
slice_head()
## # A tibble: 1 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348 SOSP M 63 NA 23.9
… or a specified number of rows from the top of the data frame:
measures %>%
slice_head(n = 5)
## # A tibble: 5 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348 SOSP M 63 NA 23.9
## 2 1991-46167 GRCA M 90 NA 37.4
## 3 2261-38347 SOSP M 67 NA 25.9
## 4 1991-46166 NOCA M 93 NA 42.2
## 5 2411-37708 GRCA M 85 NA 33
Subset the data to the last row(s): The opposite of
slice_head() is slice_tail(), which can be
used to grab the last row in a data frame:
measures %>%
slice_tail()
## # A tibble: 1 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-44542 CACH U 59 49 9.1
… or the last several rows in a data frame:
measures %>%
slice_tail(n = 5)
## # A tibble: 5 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-44538 CACH U 57 48 8.3
## 2 2750-44541 CACH U 60.5 52.5 9.2
## 3 2750-44543 CACH U 59 49 NA
## 4 2750-44540 CACH F 59.5 50 9.3
## 5 2750-44542 CACH U 59 49 9.1
Slice and arrange: The slice()
functions becomes a very powerful tool for data exploration and
wrangling when used in conjunction with arrange(). For
example, perhaps we want to see the five lowest mass values in the
measures data:
measures %>%
arrange(mass) %>%
slice_head(n = 5)
## # A tibble: 5 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-41582 CACH U 62 56 6.3
## 2 2460-22927 CACH F 60 NA 7
## 3 2750-44514 HOWR U 26.5 NA 7.1
## 4 2810-03340 HOWR M 49.5 40 7.1
## 5 2430-29237 CACH U 63 NA 7.2
When you group a data frame, using the function
group_by(), you split the data by a variable:
measures %>%
group_by(spp)
## # A tibble: 6,835 × 6
## # Groups: spp [23]
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2261-38348 SOSP M 63 NA 23.9
## 2 1991-46167 GRCA M 90 NA 37.4
## 3 2261-38347 SOSP M 67 NA 25.9
## 4 1991-46166 NOCA M 93 NA 42.2
## 5 2411-37708 GRCA M 85 NA 33
## 6 2411-37710 NOCA F 91 NA 45
## 7 2411-37709 GRCA M 87 NA 45
## 8 2341-29331 CARW U NA NA 22
## 9 1202-99076 AMRO F 117 NA 91
## 10 1202-99075 AMRO M 139 NA 75
## # … with 6,825 more rows
It doesn’t look like anything happened but it did. Notice that, just
above the tibble, you can see that there are now groups
(Groups: spp [23]).
Each function run after the group_by() function will run
on each group separately. This can be very useful for subsetting data
frames.
For example, we can extract the three lowest mass values for each species with:
measures %>%
arrange(mass) %>%
group_by(spp) %>%
slice_head(n = 2)
## # A tibble: 41 × 6
## # Groups: spp [23]
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2660-67607 AMGO F 66 45 11.3
## 2 2750-44551 AMGO M 70 46 12.2
## 3 0962-47742 AMRO U NA NA 38.8
## 4 0742-29601 AMRO U 105 NA 45.4
## 5 2711-77728 BHCO U 40.5 14 23
## 6 2711-77729 BHCO U 40.4 15.6 24
## 7 1232-45256 BLJA U 131 133 89.2
## 8 2750-41582 CACH U 62 56 6.3
## 9 2460-22927 CACH F 60 NA 7
## 10 2571-90110 CARW U 56 NA 8
## # … with 31 more rows
Whenever you use group_by(), you should always ungroup the
data afterwards or else all subsequent functions will run on the grouped
data!
measures %>%
arrange(mass) %>%
group_by(spp) %>%
slice_head(n = 3) %>%
ungroup()
## # A tibble: 57 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2660-67607 AMGO F 66 45 11.3
## 2 2750-44551 AMGO M 70 46 12.2
## 3 2750-44550 AMGO F 68.5 44 12.7
## 4 0962-47742 AMRO U NA NA 38.8
## 5 0742-29601 AMRO U 105 NA 45.4
## 6 0742-29427 AMRO U 103 NA 48.9
## 7 2711-77728 BHCO U 40.5 14 23
## 8 2711-77729 BHCO U 40.4 15.6 24
## 9 2661-31502 BHCO F 92 61 35.3
## 10 1232-45256 BLJA U 131 133 89.2
## # … with 47 more rows
Now you!
Individuals may have been captured and measured more than once. The band number is a unique identifier for each individual. Subset the data to the first record for each individual. Assign to your global environment with the name first_measures.
first_measures <- measures %>% group_by(band_number) %>% slice_head() %>% ungroup()
We can subset rows based on whether values satisfy a given condition.
This is called filtering and is achieved using the
filter() function.
Filter by one variable: To filter based on
condition, we provide the name of the target column, a relational
operator (see ?Comparison), and the condition upon which to
filter the data.
For example, in the below, I filter first_measures (see
“Now you!”, above) to rows where the wing value is greater than 80
mm:
first_measures %>%
filter(wing > 80) %>%
arrange(wing)
## # A tibble: 2,950 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1141-05732 TUTI F 80.5 69.5 20.4
## 2 0791-17253 GRCA M 81 NA 45.2
## 3 2301-56014 GRCA F 81 NA 35
## 4 2421-04527 NOCA M 81 88 36.2
## 5 2421-11347 NOCA U 81 NA 37.3
## 6 2711-77545 TUTI F 81 84 21.2
## 7 2711-77571 TUTI U 81 75 24.6
## 8 2711-77573 TUTI U 81 75 22.5
## 9 2711-77582 TUTI U 81 68.5 21.7
## 10 2711-77733 TUTI M 81 72.5 20.9
## # … with 2,940 more rows
… and here to where the value is greater than or equal to 80 mm:
first_measures %>%
filter(wing >= 80) %>%
arrange(wing)
## # A tibble: 2,962 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1141-05731 TUTI M 80 70 21.7
## 2 1141-05736 TUTI F 80 73 19.5
## 3 1232-87932 AMRO U 80 NA 50.2
## 4 2301-56561 NOCA F 80 NA NA
## 5 2301-56929 GRCA M 80 NA 31.5
## 6 2341-03957 TUTI F 80 70 22.3
## 7 2411-37737 NOCA U 80 NA 41
## 8 2641-63973 NOCA U 80 NA 39
## 9 2711-77543 TUTI F 80 69 19.9
## 10 2711-77583 TUTI U 80 70 21.1
## # … with 2,952 more rows
We can also filter based on a character value. Below, I filter to
where the value in the species column (spp) is “GRCA” (Gray
catbird):
first_measures %>%
filter(spp == 'GRCA')
## # A tibble: 1,266 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17029 GRCA M 90.5 99 34.7
## 2 0791-17030 GRCA M 90 95 33.4
## 3 0791-17032 GRCA M 93.5 101 37.4
## 4 0791-17033 GRCA M NA NA 36.4
## 5 0791-17045 GRCA M 90 94 35
## 6 0791-17048 GRCA F 90.5 102 42.7
## 7 0791-17057 GRCA M 92.5 97.5 38.9
## 8 0791-17058 GRCA M 95 99.5 37.2
## 9 0791-17068 GRCA M 87 93 38.5
## 10 0791-17074 GRCA M 94.5 100 38.9
## # … with 1,256 more rows
A word of warning! When you filter a data frame based
on a condition, all NA values will be dropped! That can
lead to unexpected behavior if you’re not careful.
It’s often useful to supply a derived summary statistic as a
condition. For example, below I filter first_measures to
where the value of the wing column is equal to the maximum
wing length value:
first_measures %>%
filter(
wing ==
max(wing,
na.rm = TRUE))
## # A tibble: 1 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1202-99075 AMRO M 139 NA 75
… and here to where the wing value is equal to the
minimum wing length value:
first_measures %>%
filter(
wing ==
min(wing,
na.rm = TRUE))
## # A tibble: 2 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2750-64434 HOWR U 21.5 NA 8.2
## 2 2750-64435 HOWR U 21.5 NA 8.6
It is
often necessary to include na.rm = TRUE when conducting
statistical summaries. Otherwise, summaries of columns that contain
missing values will return a value of NA.
Negation is a powerful tool for filtering. Using the !=
operator, we can filter values that are NOT equal to the specified
condition:
first_measures %>%
filter(spp != 'GRCA')
## # A tibble: 5,113 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29420 AMRO M 126 NA 73.4
## 8 0742-29421 AMRO M 129 NA 70.2
## 9 0742-29422 AMRO F 119 NA 81.6
## 10 0742-29423 AMRO F 122 NA 77.4
## # … with 5,103 more rows
We can also use the logical negation operator itself, !,
to negate a filtering statement:
first_measures %>%
filter(!spp == 'GRCA')
## # A tibble: 5,113 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29420 AMRO M 126 NA 73.4
## 8 0742-29421 AMRO M 129 NA 70.2
## 9 0742-29422 AMRO F 119 NA 81.6
## 10 0742-29423 AMRO F 122 NA 77.4
## # … with 5,103 more rows
We sometimes want to remove NA values from our data. We can do this
with the is.na() function. This function is a logical test
of whether a given value is NA. Since a logical test returns a value or
TRUE or FALSE, it is unnecessary to supply relational operators.
Here, I’ll filter first_measures to where
tl (tail length) values are NA:
first_measures %>%
filter(is.na(tl))
## # A tibble: 4,392 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29420 AMRO M 126 NA 73.4
## 8 0742-29421 AMRO M 129 NA 70.2
## 9 0742-29422 AMRO F 119 NA 81.6
## 10 0742-29423 AMRO F 122 NA 77.4
## # … with 4,382 more rows
We can filter to values that are not NA using the
negation operator, !:
first_measures %>%
filter(!is.na(tl))
## # A tibble: 1,987 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29438 AMRO M 130 100 71.4
## 2 0742-29489 AMRO F 126 103 81.9
## 3 0742-29490 AMRO M 128 100 73.7
## 4 0742-29702 AMRO F 117 88 72.8
## 5 0791-17014 NOCA M 91.5 96.5 41
## 6 0791-17018 NOCA M 99.5 105 45.4
## 7 0791-17019 NOCA F 91.5 92 41.3
## 8 0791-17020 NOCA M 90 23.9 41.7
## 9 0791-17021 NOMO F 103 111 54.1
## 10 0791-17025 NOCA M 91 96 43.2
## # … with 1,977 more rows
AND: When we want to filter based on whether a given value or values match multiple conditions, we can chain together sets of filtering statements:
first_measures %>%
filter(wing > 80) %>%
filter(wing < 90)
## # A tibble: 1,086 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17068 GRCA M 87 93 38.5
## 2 0791-17078 GRCA M 89 95.5 37.3
## 3 0791-17211 GRCA F 86 NA 40.8
## 4 0791-17212 GRCA M 83 NA 34.6
## 5 0791-17214 GRCA F 89.5 NA 33.3
## 6 0791-17222 GRCA F 88.5 NA 32.9
## 7 0791-17223 GRCA M 86 NA 37.9
## 8 0791-17239 GRCA F 89 NA 43.3
## 9 0791-17248 GRCA M 88 NA 37.5
## 10 0791-17249 GRCA F 88 NA 40.3
## # … with 1,076 more rows
… or much more parsimoniously, separate filtering
statements with the logical operator & (see
?base::Logic):
first_measures %>%
filter(wing > 80 &
wing < 90)
## # A tibble: 1,086 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17068 GRCA M 87 93 38.5
## 2 0791-17078 GRCA M 89 95.5 37.3
## 3 0791-17211 GRCA F 86 NA 40.8
## 4 0791-17212 GRCA M 83 NA 34.6
## 5 0791-17214 GRCA F 89.5 NA 33.3
## 6 0791-17222 GRCA F 88.5 NA 32.9
## 7 0791-17223 GRCA M 86 NA 37.9
## 8 0791-17239 GRCA F 89 NA 43.3
## 9 0791-17248 GRCA M 88 NA 37.5
## 10 0791-17249 GRCA F 88 NA 40.3
## # … with 1,076 more rows
… or, even more parsimoniously, simply separate the two filters with a comma:
first_measures %>%
filter(wing > 80,
wing < 90)
## # A tibble: 1,086 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17068 GRCA M 87 93 38.5
## 2 0791-17078 GRCA M 89 95.5 37.3
## 3 0791-17211 GRCA F 86 NA 40.8
## 4 0791-17212 GRCA M 83 NA 34.6
## 5 0791-17214 GRCA F 89.5 NA 33.3
## 6 0791-17222 GRCA F 88.5 NA 32.9
## 7 0791-17223 GRCA M 86 NA 37.9
## 8 0791-17239 GRCA F 89 NA 43.3
## 9 0791-17248 GRCA M 88 NA 37.5
## 10 0791-17249 GRCA F 88 NA 40.3
## # … with 1,076 more rows
While
less parsimonious, chained filter statements can be less
memory-intensive than using & or ,. You
won’t recognize a difference until your data are very
large (like millions of records).
OR: In the above, each of the conditions of the filter must be met. While that’s often our goal, sometimes we want to filter the data to where either conditions are met.
Let’s see what happens when we try to filter the data to Gray catbird (GRCA) AND Northern cardinal (NOCA):
first_measures %>%
filter(spp == 'GRCA',
spp == 'NOCA')
## # A tibble: 0 × 6
## # … with 6 variables: band_number <chr>, spp <chr>, sex <chr>, wing <dbl>,
## # tl <dbl>, mass <dbl>
This returns no rows, because no species is both Gray catbird AND Northern cardinal!
To address this, we can use the OR operator,
|:
first_measures %>%
filter(spp == 'GRCA'|spp == 'NOCA')
## # A tibble: 2,512 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17014 NOCA M 91.5 96.5 41
## 2 0791-17018 NOCA M 99.5 105 45.4
## 3 0791-17019 NOCA F 91.5 92 41.3
## 4 0791-17020 NOCA M 90 23.9 41.7
## 5 0791-17025 NOCA M 91 96 43.2
## 6 0791-17027 NOCA M 94 NA 43
## 7 0791-17029 GRCA M 90.5 99 34.7
## 8 0791-17030 GRCA M 90 95 33.4
## 9 0791-17031 NOCA M 95.5 107 43.5
## 10 0791-17032 GRCA M 93.5 101 37.4
## # … with 2,502 more rows
… or, more parsimoniously, use the %in% operator. This
will extract a value that’s within a vector of provided values:
first_measures %>%
filter(spp %in% c('GRCA', 'NOCA'))
## # A tibble: 2,512 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17014 NOCA M 91.5 96.5 41
## 2 0791-17018 NOCA M 99.5 105 45.4
## 3 0791-17019 NOCA F 91.5 92 41.3
## 4 0791-17020 NOCA M 90 23.9 41.7
## 5 0791-17025 NOCA M 91 96 43.2
## 6 0791-17027 NOCA M 94 NA 43
## 7 0791-17029 GRCA M 90.5 99 34.7
## 8 0791-17030 GRCA M 90 95 33.4
## 9 0791-17031 NOCA M 95.5 107 43.5
## 10 0791-17032 GRCA M 93.5 101 37.4
## # … with 2,502 more rows
As with the filtering operations above, this can also be used with multiple logical conditions:
first_measures %>%
filter(spp %in% c('GRCA', 'NOCA'),
sex == 'F')
## # A tibble: 785 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17019 NOCA F 91.5 92 41.3
## 2 0791-17044 NOCA F 92 95 42.1
## 3 0791-17046 NOCA F 93 94.5 47.8
## 4 0791-17048 GRCA F 90.5 102 42.7
## 5 0791-17064 NOCA F 94 92 42.2
## 6 0791-17079 GRCA F 90 92.5 38.3
## 7 0791-17211 GRCA F 86 NA 40.8
## 8 0791-17214 GRCA F 89.5 NA 33.3
## 9 0791-17215 NOCA F 92 NA 43.6
## 10 0791-17222 GRCA F 88.5 NA 32.9
## # … with 775 more rows
You can continue to add filtering statements using a single variable, or multiple variables. For example, we can filter by wing length and subset the data to where the species is Gray catbird (GRCA) with this chained analysis:
first_measures %>%
filter(wing > 80) %>%
filter(wing < 90) %>%
filter(spp == 'GRCA')
## # A tibble: 698 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17068 GRCA M 87 93 38.5
## 2 0791-17078 GRCA M 89 95.5 37.3
## 3 0791-17211 GRCA F 86 NA 40.8
## 4 0791-17212 GRCA M 83 NA 34.6
## 5 0791-17214 GRCA F 89.5 NA 33.3
## 6 0791-17222 GRCA F 88.5 NA 32.9
## 7 0791-17223 GRCA M 86 NA 37.9
## 8 0791-17239 GRCA F 89 NA 43.3
## 9 0791-17248 GRCA M 88 NA 37.5
## 10 0791-17249 GRCA F 88 NA 40.3
## # … with 688 more rows
… or within a single function (way more parsimonious):
first_measures %>%
filter(wing > 80,
wing < 90,
spp == 'GRCA')
## # A tibble: 698 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0791-17068 GRCA M 87 93 38.5
## 2 0791-17078 GRCA M 89 95.5 37.3
## 3 0791-17211 GRCA F 86 NA 40.8
## 4 0791-17212 GRCA M 83 NA 34.6
## 5 0791-17214 GRCA F 89.5 NA 33.3
## 6 0791-17222 GRCA F 88.5 NA 32.9
## 7 0791-17223 GRCA M 86 NA 37.9
## 8 0791-17239 GRCA F 89 NA 43.3
## 9 0791-17248 GRCA M 88 NA 37.5
## 10 0791-17249 GRCA F 88 NA 40.3
## # … with 688 more rows
If you want to apply the same filter across multiple variables, we
can add the functions if_any() or if_all() to
our filtering statements.
OR: We can test if one of a selected set of
variables satisfies a logical condition with the if_any()
function. We supply a vector of variables to test, supply a formula
(with the ~ operator, see ?tilde), a filtering
statement, and specify the location of the variable in our filtering
statement with .x.
In the below, I filter to values where either the wing or
tail length (tl) is NA:
first_measures %>%
filter(
if_any(
wing:tl,
~is.na(.x)
))
## # A tibble: 4,546 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29420 AMRO M 126 NA 73.4
## 8 0742-29421 AMRO M 129 NA 70.2
## 9 0742-29422 AMRO F 119 NA 81.6
## 10 0742-29423 AMRO F 122 NA 77.4
## # … with 4,536 more rows
… We can subset the data to rows where neither wing nor tail are values are missing using the logical negation operator:
first_measures %>%
filter(
if_any(
wing:tl,
~!is.na(.x)
))
## # A tibble: 6,050 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4
## 2 0742-29421 AMRO M 129 NA 70.2
## 3 0742-29422 AMRO F 119 NA 81.6
## 4 0742-29423 AMRO F 122 NA 77.4
## 5 0742-29426 AMRO U 108 NA 64.4
## 6 0742-29427 AMRO U 103 NA 48.9
## 7 0742-29428 AMRO F 115 NA 76.1
## 8 0742-29431 AMRO F 119 NA 72.4
## 9 0742-29432 AMRO M 130 NA 71
## 10 0742-29433 AMRO M 122 NA 73.5
## # … with 6,040 more rows
AND: We can use if_all() to test
whether both values satisfies a logical condition. The construction of
the function is equivalent to that of if_any().
Here, we’ll subset to rows where both wing and tail values are missing:
first_measures %>%
filter(
if_all(
wing:tl,
~is.na(.x)
))
## # A tibble: 329 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29425 AMRO U NA NA 65.2
## 8 0742-29762 AMRO U NA NA NA
## 9 0742-29763 AMRO U NA NA NA
## 10 0791-17033 GRCA M NA NA 36.4
## # … with 319 more rows
… and subset to rows where both wing and tail values are not
NA:
first_measures %>%
filter(
if_all(
wing:tl,
~!is.na(.x)
))
## # A tibble: 1,833 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29438 AMRO M 130 100 71.4
## 2 0742-29489 AMRO F 126 103 81.9
## 3 0742-29490 AMRO M 128 100 73.7
## 4 0742-29702 AMRO F 117 88 72.8
## 5 0791-17014 NOCA M 91.5 96.5 41
## 6 0791-17018 NOCA M 99.5 105 45.4
## 7 0791-17019 NOCA F 91.5 92 41.3
## 8 0791-17020 NOCA M 90 23.9 41.7
## 9 0791-17021 NOMO F 103 111 54.1
## 10 0791-17025 NOCA M 91 96 43.2
## # … with 1,823 more rows
Before applying if_any() or if_all() to your
own data, be sure to understand how the functions are subsetting the
data. It’s a good idea to test the functions on a small set of dummy
data prior to applying it to a full dataset.
We can also apply a filter to grouped data. Recall the
group_by() function splits the data into groups and
subsequent functions applied to grouped data will be applied to each
group separately.
For example, let’s subset the data to the minimum mass
for female and male birds:
first_measures %>%
filter(sex %in% c('F', 'M')) %>%
group_by(sex) %>%
filter(
wing == min(wing, na.rm = TRUE))
## # A tibble: 2 × 6
## # Groups: sex [2]
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2250-24175 HOWR M 42 NA 10.5
## 2 2870-37309 HOWR F 42 42 10.8
We see that the lightest female bird captured was a CACH (Carolina chickadee) and the lighted male bird captured was a HOWR (House wren).
Counting observations in each group: Grouped filters
can be used to subset the data to only groups with a certain number of
observations. To do this, we use the group size function,
n(), to count the number of observations in each group.
Here, we’ll subset the data to only species with 100 or more records:
first_measures %>%
group_by(spp) %>%
filter(n() >= 100) %>%
ungroup()
## # A tibble: 6,207 × 6
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29174 AMRO U NA NA NA
## 2 0742-29175 AMRO U NA NA NA
## 3 0742-29176 AMRO U NA NA NA
## 4 0742-29177 AMRO U NA NA NA
## 5 0742-29178 AMRO U NA NA NA
## 6 0742-29179 AMRO U NA NA NA
## 7 0742-29420 AMRO M 126 NA 73.4
## 8 0742-29421 AMRO M 129 NA 70.2
## 9 0742-29422 AMRO F 119 NA 81.6
## 10 0742-29423 AMRO F 122 NA 77.4
## # … with 6,197 more rows
Important! If we intend to use these data
further, it’s necessary to ungroup() the data.
Now you!
Filter to birds where neither mass nor wing are NA and there are at least 50 individuals measured per species. Assign the object to your global environment with the name measures_subset.
measures_subset <- first_measures %>% filter( if_all( c(wing, mass), ~!is.na(.x))) %>% group_by(spp) %>% filter(n() > 100) %>% ungroup()
The mutate() family of functions was covered in our
previous Custom functions … lesson, but I’m including some
brief material on mutation here for completeness.
Modify or add a column with mutate: The
mutate() function can be used to modify an existing column
or columns.
For example, we can calculate the ratio between the mass
and wing of a bird and add it as a new column named
mass_wing (*Note: For measures_subset see “Now
you!”, above):
measures_subset %>%
mutate(mass_wing = mass/wing)
## # A tibble: 5,609 × 7
## band_number spp sex wing tl mass mass_wing
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4 0.583
## 2 0742-29421 AMRO M 129 NA 70.2 0.544
## 3 0742-29422 AMRO F 119 NA 81.6 0.686
## 4 0742-29423 AMRO F 122 NA 77.4 0.634
## 5 0742-29426 AMRO U 108 NA 64.4 0.596
## 6 0742-29427 AMRO U 103 NA 48.9 0.475
## 7 0742-29428 AMRO F 115 NA 76.1 0.662
## 8 0742-29431 AMRO F 119 NA 72.4 0.608
## 9 0742-29432 AMRO M 130 NA 71 0.546
## 10 0742-29433 AMRO M 122 NA 73.5 0.602
## # … with 5,599 more rows
I should note that adding that variable name is important because we can end up with some ugly (and difficult to use) column names without it:
measures_subset %>%
mutate(mass/wing)
## # A tibble: 5,609 × 7
## band_number spp sex wing tl mass `mass/wing`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4 0.583
## 2 0742-29421 AMRO M 129 NA 70.2 0.544
## 3 0742-29422 AMRO F 119 NA 81.6 0.686
## 4 0742-29423 AMRO F 122 NA 77.4 0.634
## 5 0742-29426 AMRO U 108 NA 64.4 0.596
## 6 0742-29427 AMRO U 103 NA 48.9 0.475
## 7 0742-29428 AMRO F 115 NA 76.1 0.662
## 8 0742-29431 AMRO F 119 NA 72.4 0.608
## 9 0742-29432 AMRO M 130 NA 71 0.546
## 10 0742-29433 AMRO M 122 NA 73.5 0.602
## # … with 5,599 more rows
It might seem obvious, but mutate() divided the values
in the mass column by the values in the wing
column. Let’s have a look at what that means though. Both columns are a
vector or the same length. In R, when you divide one vector by another,
the calculation is conducted by position. You can verify this
with the following calculation:
1:3/4:6
## [1] 0.25 0.40 0.50
In the above, the value 1 is divided by 4, 2 is divided by 5, and 3 is divided by 6.
We can also use mutate to add new data to the data frame. The only rule is that the added vector (i.e., column) must be the same length as the data frame.
For example, let’s add an id column, which will
basically just be the row number (Note: This doesn’t make a good
primary key though!)
measures_subset %>%
mutate(id = 1:nrow(.))
## # A tibble: 5,609 × 7
## band_number spp sex wing tl mass id
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 0742-29420 AMRO M 126 NA 73.4 1
## 2 0742-29421 AMRO M 129 NA 70.2 2
## 3 0742-29422 AMRO F 119 NA 81.6 3
## 4 0742-29423 AMRO F 122 NA 77.4 4
## 5 0742-29426 AMRO U 108 NA 64.4 5
## 6 0742-29427 AMRO U 103 NA 48.9 6
## 7 0742-29428 AMRO F 115 NA 76.1 7
## 8 0742-29431 AMRO F 119 NA 72.4 8
## 9 0742-29432 AMRO M 130 NA 71 9
## 10 0742-29433 AMRO M 122 NA 73.5 10
## # … with 5,599 more rows
I like to use the function row_number() for this
task:
measures_subset %>%
mutate(id = row_number())
## # A tibble: 5,609 × 7
## band_number spp sex wing tl mass id
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 0742-29420 AMRO M 126 NA 73.4 1
## 2 0742-29421 AMRO M 129 NA 70.2 2
## 3 0742-29422 AMRO F 119 NA 81.6 3
## 4 0742-29423 AMRO F 122 NA 77.4 4
## 5 0742-29426 AMRO U 108 NA 64.4 5
## 6 0742-29427 AMRO U 103 NA 48.9 6
## 7 0742-29428 AMRO F 115 NA 76.1 7
## 8 0742-29431 AMRO F 119 NA 72.4 8
## 9 0742-29432 AMRO M 130 NA 71 9
## 10 0742-29433 AMRO M 122 NA 73.5 10
## # … with 5,599 more rows
The transmute() function works similarly to mutate, but
it allows us to add and subset columns in a data frame in one step.
Let’s see what happens when we use transmute with our
mass_wing calculation.
measures_subset %>%
transmute(mass_wing = mass/wing)
## # A tibble: 5,609 × 1
## mass_wing
## <dbl>
## 1 0.583
## 2 0.544
## 3 0.686
## 4 0.634
## 5 0.596
## 6 0.475
## 7 0.662
## 8 0.608
## 9 0.546
## 10 0.602
## # … with 5,599 more rows
This dropped all of the other columns!
What if we wanted to do the equivalent of the below, where we conduct
our mutate and then subset the columns to species and
mass_wing?
measures_subset %>%
mutate(mass_wing = mass/wing) %>%
select(spp, mass_wing)
## # A tibble: 5,609 × 2
## spp mass_wing
## <chr> <dbl>
## 1 AMRO 0.583
## 2 AMRO 0.544
## 3 AMRO 0.686
## 4 AMRO 0.634
## 5 AMRO 0.596
## 6 AMRO 0.475
## 7 AMRO 0.662
## 8 AMRO 0.608
## 9 AMRO 0.546
## 10 AMRO 0.602
## # … with 5,599 more rows
To do this with transmute(), we can simply add
spp as an additional column:
measures_subset %>%
transmute(
spp,
mass_wing = mass/wing)
## # A tibble: 5,609 × 2
## spp mass_wing
## <chr> <dbl>
## 1 AMRO 0.583
## 2 AMRO 0.544
## 3 AMRO 0.686
## 4 AMRO 0.634
## 5 AMRO 0.596
## 6 AMRO 0.475
## 7 AMRO 0.662
## 8 AMRO 0.608
## 9 AMRO 0.546
## 10 AMRO 0.602
## # … with 5,599 more rows
We can also rename the column in the same step:
measures_subset %>%
transmute(
species = spp,
mass_wing = mass/wing)
## # A tibble: 5,609 × 2
## species mass_wing
## <chr> <dbl>
## 1 AMRO 0.583
## 2 AMRO 0.544
## 3 AMRO 0.686
## 4 AMRO 0.634
## 5 AMRO 0.596
## 6 AMRO 0.475
## 7 AMRO 0.662
## 8 AMRO 0.608
## 9 AMRO 0.546
## 10 AMRO 0.602
## # … with 5,599 more rows
The
transmute() function is pretty much just a combination of
mutate() and select(). The decision of when to
use transmute() vs. mutate() should be
determined by code parsimony and the ability of your code to communicate
the steps taken during data processing.
Similar to other functions, mutate() functions can be
run on grouped data.
For example, let’s determine whether a given row represents the first observation for a given species:
measures_subset %>%
mutate(id = row_number()) %>%
group_by(spp) %>%
mutate(
first_obs = id == min(id)) %>%
ungroup()
## # A tibble: 5,609 × 8
## band_number spp sex wing tl mass id first_obs
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <lgl>
## 1 0742-29420 AMRO M 126 NA 73.4 1 TRUE
## 2 0742-29421 AMRO M 129 NA 70.2 2 FALSE
## 3 0742-29422 AMRO F 119 NA 81.6 3 FALSE
## 4 0742-29423 AMRO F 122 NA 77.4 4 FALSE
## 5 0742-29426 AMRO U 108 NA 64.4 5 FALSE
## 6 0742-29427 AMRO U 103 NA 48.9 6 FALSE
## 7 0742-29428 AMRO F 115 NA 76.1 7 FALSE
## 8 0742-29431 AMRO F 119 NA 72.4 8 FALSE
## 9 0742-29432 AMRO M 130 NA 71 9 FALSE
## 10 0742-29433 AMRO M 122 NA 73.5 10 FALSE
## # … with 5,599 more rows
… we can use this information to subset the data to just the first observation in each group:
measures_subset %>%
mutate(
id = row_number()) %>%
group_by(spp) %>%
mutate(
first_obs = id == min(id)) %>%
ungroup() %>%
filter(first_obs)
## # A tibble: 7 × 8
## band_number spp sex wing tl mass id first_obs
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <lgl>
## 1 0742-29420 AMRO M 126 NA 73.4 1 TRUE
## 2 0791-17014 NOCA M 91.5 96.5 41 49 TRUE
## 3 0791-17029 GRCA M 90.5 99 34.7 55 TRUE
## 4 0820-94571 CACH F 59.5 49 10.4 155 TRUE
## 5 1141-05437 CARW M 62 53 20.3 190 TRUE
## 6 1141-05444 SOSP M 66 68 20.9 194 TRUE
## 7 1821-63303 HOWR U 99 NA 13 919 TRUE
… though this can be useful in many circumstances, the second mutate statement wasn’t really necessary:
measures_subset %>%
mutate(
id = row_number()) %>%
group_by(spp) %>%
filter(
id == min(id))
## # A tibble: 7 × 7
## # Groups: spp [7]
## band_number spp sex wing tl mass id
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 0742-29420 AMRO M 126 NA 73.4 1
## 2 0791-17014 NOCA M 91.5 96.5 41 49
## 3 0791-17029 GRCA M 90.5 99 34.7 55
## 4 0820-94571 CACH F 59.5 49 10.4 155
## 5 1141-05437 CARW M 62 53 20.3 190
## 6 1141-05444 SOSP M 66 68 20.9 194
## 7 1821-63303 HOWR U 99 NA 13 919
… nor was the first!
measures_subset %>%
group_by(spp) %>%
filter(
row_number() ==
min(
row_number()))
## # A tibble: 7 × 6
## # Groups: spp [7]
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4
## 2 0791-17014 NOCA M 91.5 96.5 41
## 3 0791-17029 GRCA M 90.5 99 34.7
## 4 0820-94571 CACH F 59.5 49 10.4
## 5 1141-05437 CARW M 62 53 20.3
## 6 1141-05444 SOSP M 66 68 20.9
## 7 1821-63303 HOWR U 99 NA 13
… and even that filter was a bit of overkill!
measures_subset %>%
group_by(spp) %>%
slice_head()
## # A tibble: 7 × 6
## # Groups: spp [7]
## band_number spp sex wing tl mass
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4
## 2 0820-94571 CACH F 59.5 49 10.4
## 3 1141-05437 CARW M 62 53 20.3
## 4 0791-17029 GRCA M 90.5 99 34.7
## 5 1821-63303 HOWR U 99 NA 13
## 6 0791-17014 NOCA M 91.5 96.5 41
## 7 1141-05444 SOSP M 66 68 20.9
The summarize() function allows you to calculate summary
statistics for a grouped dataset.
Let’s run the below and think about what occurred:
measures_subset %>%
group_by(spp) %>%
summarize(
mean_wing = mean(wing, na.rm = TRUE))
## # A tibble: 7 × 2
## spp mean_wing
## <chr> <dbl>
## 1 AMRO 124.
## 2 CACH 60.6
## 3 CARW 58.8
## 4 GRCA 88.7
## 5 HOWR 49.8
## 6 NOCA 90.8
## 7 SOSP 65.2
The above provided the average wing length by species … but what is summarize doing?
We could have used mutate to calculate the mean wing length by species:
measures_subset %>%
group_by(spp) %>%
mutate(
mean_wing = mean(wing, na.rm = TRUE))
## # A tibble: 5,609 × 7
## # Groups: spp [7]
## band_number spp sex wing tl mass mean_wing
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4 124.
## 2 0742-29421 AMRO M 129 NA 70.2 124.
## 3 0742-29422 AMRO F 119 NA 81.6 124.
## 4 0742-29423 AMRO F 122 NA 77.4 124.
## 5 0742-29426 AMRO U 108 NA 64.4 124.
## 6 0742-29427 AMRO U 103 NA 48.9 124.
## 7 0742-29428 AMRO F 115 NA 76.1 124.
## 8 0742-29431 AMRO F 119 NA 72.4 124.
## 9 0742-29432 AMRO M 130 NA 71 124.
## 10 0742-29433 AMRO M 122 NA 73.5 124.
## # … with 5,599 more rows
… then ungrouped the data:
measures_subset %>%
group_by(spp) %>%
mutate(
mean_wing = mean(wing, na.rm = TRUE)) %>%
ungroup()
## # A tibble: 5,609 × 7
## band_number spp sex wing tl mass mean_wing
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0742-29420 AMRO M 126 NA 73.4 124.
## 2 0742-29421 AMRO M 129 NA 70.2 124.
## 3 0742-29422 AMRO F 119 NA 81.6 124.
## 4 0742-29423 AMRO F 122 NA 77.4 124.
## 5 0742-29426 AMRO U 108 NA 64.4 124.
## 6 0742-29427 AMRO U 103 NA 48.9 124.
## 7 0742-29428 AMRO F 115 NA 76.1 124.
## 8 0742-29431 AMRO F 119 NA 72.4 124.
## 9 0742-29432 AMRO M 130 NA 71 124.
## 10 0742-29433 AMRO M 122 NA 73.5 124.
## # … with 5,599 more rows
… then selected our columns of interest with
select():
measures_subset %>%
group_by(spp) %>%
mutate(
mean_wing = mean(wing, na.rm = TRUE)) %>%
ungroup() %>%
select(spp, mean_wing)
## # A tibble: 5,609 × 2
## spp mean_wing
## <chr> <dbl>
## 1 AMRO 124.
## 2 AMRO 124.
## 3 AMRO 124.
## 4 AMRO 124.
## 5 AMRO 124.
## 6 AMRO 124.
## 7 AMRO 124.
## 8 AMRO 124.
## 9 AMRO 124.
## 10 AMRO 124.
## # … with 5,599 more rows
… then subset the data to unique rows with
distinct():
measures_subset %>%
group_by(spp) %>%
mutate(
mean_wing = mean(wing, na.rm = TRUE)) %>%
ungroup() %>%
select(spp, mean_wing) %>%
distinct()
## # A tibble: 7 × 2
## spp mean_wing
## <chr> <dbl>
## 1 AMRO 124.
## 2 NOCA 90.8
## 3 GRCA 88.7
## 4 CACH 60.6
## 5 CARW 58.8
## 6 SOSP 65.2
## 7 HOWR 49.8
But that’s a lot of coding … summarize saves us loads of time and really enhances code parsimony!
I often use the summarize() function, in combination
with n() for calculating the number of observations per
group:
measures_subset %>%
group_by(spp) %>%
summarize(n = n())
## # A tibble: 7 × 2
## spp n
## <chr> <int>
## 1 AMRO 484
## 2 CACH 1006
## 3 CARW 610
## 4 GRCA 1157
## 5 HOWR 748
## 6 NOCA 1172
## 7 SOSP 432
What if we wanted to summarize the number of observations by multiple
groups though? We simply separate additional groups in
group_by() by a comma. In doing so, any function (e.g.,
split, filter, mutate) will run on each group separately.
In the code below, I summarize the number of observations by species and sex:
measures_subset %>%
group_by(spp, sex) %>%
summarize(n = n())
## `summarise()` has grouped output by 'spp'. You can override using the `.groups`
## argument.
## # A tibble: 28 × 3
## # Groups: spp [7]
## spp sex n
## <chr> <chr> <int>
## 1 AMRO F 201
## 2 AMRO M 199
## 3 AMRO U 69
## 4 AMRO <NA> 15
## 5 CACH F 228
## 6 CACH M 199
## 7 CACH U 532
## 8 CACH <NA> 47
## 9 CARW F 174
## 10 CARW M 151
## # … with 18 more rows
But notice the message we received … the data are still grouped. By
default, summarize() will drop the last group used. If we
summarize based on a single group, we don’t have to think about it. When
we summarize based on multiple groups, however, we need to specify
.groups = 'drop' to remove the grouping structure from the
data:
measures_subset %>%
group_by(spp, sex) %>%
summarize(n = n(),
.groups = 'drop')
## # A tibble: 28 × 3
## spp sex n
## <chr> <chr> <int>
## 1 AMRO F 201
## 2 AMRO M 199
## 3 AMRO U 69
## 4 AMRO <NA> 15
## 5 CACH F 228
## 6 CACH M 199
## 7 CACH U 532
## 8 CACH <NA> 47
## 9 CARW F 174
## 10 CARW M 151
## # … with 18 more rows
Now you!
Modify the above such that the NA values in sex are removed and the resultant object is a four column tibble with the names spp, F (female), M (male), and U (unknown).
measures_subset %>% filter(!is.na(sex)) %>% group_by(spp, sex) %>% summarize(n = n(), .groups = 'drop') %>% pivot_wider( names_from = sex, values_from = n)## # A tibble: 7 × 4 ## spp F M U ## <chr> <int> <int> <int> ## 1 AMRO 201 199 69 ## 2 CACH 228 199 532 ## 3 CARW 174 151 246 ## 4 GRCA 333 590 185 ## 5 HOWR 199 424 104 ## 6 NOCA 395 505 215 ## 7 SOSP 76 288 47
Once data are tidy they can often be difficult to use unless we are able to join information across tables.
As I’ve stated throughout this course, a key is used to reference a set of values. When we store data in our global environment, the name of the data object is actually a key that R (and you) uses to retrieve the data. If you overwrite a key, R can no longer retrieve the data.
With tidy tabular data, two types of keys are of particular interest: primary and foreign keys. A primary key is a column that provides a unique identifier for each row of a dataset. A foreign key is a column that provides a reference to the primary key of another dataset. When we conduct a join, we are joining tables based on these two sets of keys.
Below, we see how tables in a relational database are interconnected. Each connection is defined by primary and foreign keys (from Wickham and Grolemund):
For this exercise, we’ll use the tibbles visits,
counts, and birds (all of which are in your
global environment).
Each row in the visits table represents the date that a
site was visited. It contains a primary key, visit_id.
Let’s take a look:
visits
## # A tibble: 1,394 × 3
## visit_id site_id date
## <chr> <chr> <chr>
## 1 49sd69T3908d480 4F825x0 2000-04-28
## 2 N4110Z739160380 P32Ka16 2000-05-08
## 3 9000637391b4980 68yQ999 2000-05-12
## 4 22mF4Si39233280 O030IH1 2000-05-18
## 5 771vBqK39233280 P32Ka16 2000-05-18
## 6 Sa2Y688392dbe80 K0w2408 2000-05-26
## 7 L6Nwh253935a780 T541U66 2000-06-01
## 8 948U0i23936f900 F6H60d3 2000-06-02
## 9 2SnXV013936f900 09Z5mxk 2000-06-02
## 10 K6240j539384a80 6F1w870 2000-06-03
## # … with 1,384 more rows
The counts table, which we’ve use previously, represents
the counts of birds, by species, during a visit. This table contains a
primary key, count_id, a foreign key to the visits table,
visit_id, and a foreign key to the birds
table, spp.
counts
## # A tibble: 2,640 × 4
## count_id visit_id spp count
## <chr> <chr> <chr> <dbl>
## 1 Nrpl13 3Fxh65U4fb43f80 AMRO 2
## 2 m9U585 3Fxh65U4fb43f80 GRCA 2
## 3 0Wlz34 3Fxh65U4fb43f80 MODO 1
## 4 WNf927 3Fxh65U4fb43f80 NOCA 2
## 5 606emZ 3Fxh65U4fb43f80 RBWO 1
## 6 47f20A 3Fxh65U4fb43f80 SOSP 1
## 7 p24140 3Fxh65U4fb43f80 WBNU 1
## 8 696vx5 8iXAa9851afd100 AMRO 1
## 9 WrF933 8iXAa9851afd100 CACH 1
## 10 tj6Q6m 8iXAa9851afd100 CARW 1
## # … with 2,630 more rows
The birds table contains natural history information for
each of the birds observed. The primary key of this table is
species and there are no foreign keys.
birds
## # A tibble: 38 × 4
## species common_name foraging diet
## <chr> <chr> <chr> <chr>
## 1 AMCR American Crow ground omnivore
## 2 AMGO American Goldfinch foliage granivore
## 3 AMRO American Robin ground omnivore
## 4 BAOR Baltimore Oriole foliage insectivore
## 5 BHCO Brown-headed Cowbird ground omnivore
## 6 BLJA Blue Jay ground omnivore
## 7 CACH Carolina Chickadee foliage insectivore
## 8 CARW Carolina Wren ground insectivore
## 9 CEDW Cedar Waxwing foliage omnivore
## 10 CHSP Chipping Sparrow ground omnivore
## # … with 28 more rows
Joins connect two tables, joining information based on matching key values.
Left joins The join that I use most often is a
left_join(). Optimally (but not always), a
left_join() is conducted by joining the table with the
foreign key to the table with the primary key, the
target table. In a left join, all records in the target table are
maintained and only matching records from the other table are
maintained.
Let’s join counts (primary key: count_id,
foreign key: spp) to birds (primary key:
birds). We specify the name of the target table, then the
name of the table we are joining, then the keys for the join with the
argument by =. Notice that the keys don’t share a name, so
we have to supply the key to the target and joining tables as a
vector.
birds %>%
left_join(
counts,
by = c('species' = 'spp'))
## # A tibble: 2,641 × 7
## species common_name foraging diet count_id visit_id count
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 AMCR American Crow ground omnivore k96300 150jN34577eed00 1
## 2 AMCR American Crow ground omnivore 673Cyd G686H0051edc780 2
## 3 AMCR American Crow ground omnivore C04qxA q1fkjW85b15d280 1
## 4 AMCR American Crow ground omnivore uD255E c3X17E65733c780 1
## 5 AMCR American Crow ground omnivore 56D6j8 7FoMJ055aef9700 2
## 6 AMCR American Crow ground omnivore Q20bmT 3r1759651b27400 1
## 7 AMCR American Crow ground omnivore abrJHZ ak357F35ed44500 2
## 8 AMCR American Crow ground omnivore 5U846a 9Y5P5465761ec00 1
## 9 AMCR American Crow ground omnivore 81z4W4 7q4F5J3573a5f00 2
## 10 AMCR American Crow ground omnivore B7F819 036560E5941ce00 1
## # … with 2,631 more rows
What if we were only interested in the common_name
variable in the birds table? We can use select to subset the table to
the variables of interest (optimally before the join):
birds %>%
select(species, common_name) %>%
left_join(
counts,
by = c('species' = 'spp'))
## # A tibble: 2,641 × 5
## species common_name count_id visit_id count
## <chr> <chr> <chr> <chr> <dbl>
## 1 AMCR American Crow k96300 150jN34577eed00 1
## 2 AMCR American Crow 673Cyd G686H0051edc780 2
## 3 AMCR American Crow C04qxA q1fkjW85b15d280 1
## 4 AMCR American Crow uD255E c3X17E65733c780 1
## 5 AMCR American Crow 56D6j8 7FoMJ055aef9700 2
## 6 AMCR American Crow Q20bmT 3r1759651b27400 1
## 7 AMCR American Crow abrJHZ ak357F35ed44500 2
## 8 AMCR American Crow 5U846a 9Y5P5465761ec00 1
## 9 AMCR American Crow 81z4W4 7q4F5J3573a5f00 2
## 10 AMCR American Crow B7F819 036560E5941ce00 1
## # … with 2,631 more rows
If we no longer have a need for the key column after the join, we can remove it with a select argument after the join:
birds %>%
select(species, common_name) %>%
left_join(
counts,
by = c('species' = 'spp')) %>%
select(!species)
## # A tibble: 2,641 × 4
## common_name count_id visit_id count
## <chr> <chr> <chr> <dbl>
## 1 American Crow k96300 150jN34577eed00 1
## 2 American Crow 673Cyd G686H0051edc780 2
## 3 American Crow C04qxA q1fkjW85b15d280 1
## 4 American Crow uD255E c3X17E65733c780 1
## 5 American Crow 56D6j8 7FoMJ055aef9700 2
## 6 American Crow Q20bmT 3r1759651b27400 1
## 7 American Crow abrJHZ ak357F35ed44500 2
## 8 American Crow 5U846a 9Y5P5465761ec00 1
## 9 American Crow 81z4W4 7q4F5J3573a5f00 2
## 10 American Crow B7F819 036560E5941ce00 1
## # … with 2,631 more rows
What if we don’t need the count_id field? Because of my
obsession with code parsimony, you might think that I’d remove it in the
last select statement. Instead, I remove it before joining the two data
sets.
birds %>%
select(species, common_name) %>%
left_join(
counts %>%
select(!count_id),
by = c('species' = 'spp')) %>%
select(!species)
## # A tibble: 2,641 × 3
## common_name visit_id count
## <chr> <chr> <dbl>
## 1 American Crow 150jN34577eed00 1
## 2 American Crow G686H0051edc780 2
## 3 American Crow q1fkjW85b15d280 1
## 4 American Crow c3X17E65733c780 1
## 5 American Crow 7FoMJ055aef9700 2
## 6 American Crow 3r1759651b27400 1
## 7 American Crow ak357F35ed44500 2
## 8 American Crow 9Y5P5465761ec00 1
## 9 American Crow 7q4F5J3573a5f00 2
## 10 American Crow 036560E5941ce00 1
## # … with 2,631 more rows
This order of operations allows you to step through the code and have a solid sense of the data being modified at each step.
Let’s now join our counts table (foreign key:
visit_id) to the visits table (primary key:
visit_id).
visits %>%
left_join(
counts,
by = 'visit_id')
## # A tibble: 3,718 × 6
## visit_id site_id date count_id spp count
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 49sd69T3908d480 4F825x0 2000-04-28 <NA> <NA> NA
## 2 N4110Z739160380 P32Ka16 2000-05-08 <NA> <NA> NA
## 3 9000637391b4980 68yQ999 2000-05-12 <NA> <NA> NA
## 4 22mF4Si39233280 O030IH1 2000-05-18 <NA> <NA> NA
## 5 771vBqK39233280 P32Ka16 2000-05-18 <NA> <NA> NA
## 6 Sa2Y688392dbe80 K0w2408 2000-05-26 <NA> <NA> NA
## 7 L6Nwh253935a780 T541U66 2000-06-01 <NA> <NA> NA
## 8 948U0i23936f900 F6H60d3 2000-06-02 <NA> <NA> NA
## 9 2SnXV013936f900 09Z5mxk 2000-06-02 <NA> <NA> NA
## 10 K6240j539384a80 6F1w870 2000-06-03 <NA> <NA> NA
## # … with 3,708 more rows
This didn’t work as well as I’d hoped. The left_join
included visits where no point counts occurred – and there were
apparently a lot of them.
Inner joins: We can join only matching keys using an
inner_join(). Only key values that are present in the
target and the joining tables are maintained:
visits %>%
inner_join(
counts,
by = 'visit_id')
## # A tibble: 2,640 × 6
## visit_id site_id date count_id spp count
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 43929684fa31c00 061Ud74 2012-05-04 iqo47Z AMRO 1
## 2 43929684fa31c00 061Ud74 2012-05-04 1Nit1v BHCO 1
## 3 43929684fa31c00 061Ud74 2012-05-04 169DZQ COGR 3
## 4 43929684fa31c00 061Ud74 2012-05-04 2l7d7X GRCA 1
## 5 43929684fa31c00 061Ud74 2012-05-04 r0B33O HAWO 1
## 6 43929684fa31c00 061Ud74 2012-05-04 974f13 HOWR 1
## 7 43929684fa31c00 061Ud74 2012-05-04 1p307m NOCA 2
## 8 43929684fa31c00 061Ud74 2012-05-04 w032J0 RBWO 2
## 9 43929684fa31c00 061Ud74 2012-05-04 nI4g86 SOSP 2
## 10 7vAF6bg4fab0500 3M87363 2012-05-10 985k62 AMRO 2
## # … with 2,630 more rows
I use inner joins often but, if you care about the values that aren’t shared across tables, and aren’t careful in exploring your data, inner joins can sometimes lead to data loss.
Full joins: A full join will maintain all records
from both tables, regardless of whether the keys match. Non-matching
values are coded as NA.
To illustrate this, I’ll join counts from a single visit
to the birds table (and select just a few of the columns to
make the table more legible):
birds %>%
select(species, common_name) %>%
full_join(
counts %>%
filter(visit_id == '2h08H4660c93f00'),
by = c('species' = 'spp'))
## # A tibble: 38 × 5
## species common_name count_id visit_id count
## <chr> <chr> <chr> <chr> <dbl>
## 1 AMCR American Crow <NA> <NA> NA
## 2 AMGO American Goldfinch <NA> <NA> NA
## 3 AMRO American Robin <NA> <NA> NA
## 4 BAOR Baltimore Oriole <NA> <NA> NA
## 5 BHCO Brown-headed Cowbird <NA> <NA> NA
## 6 BLJA Blue Jay o41359 2h08H4660c93f00 2
## 7 CACH Carolina Chickadee <NA> <NA> NA
## 8 CARW Carolina Wren 4Z17jG 2h08H4660c93f00 1
## 9 CEDW Cedar Waxwing <NA> <NA> NA
## 10 CHSP Chipping Sparrow <NA> <NA> NA
## # … with 28 more rows
This can be very useful when we’re looking to maintain those
NA values!
We use joins to filter data based on matching (or non-matching) key-values. These are called filtering joins. In a filtering join, no columns are added to the target data frame.
Semi-joins: Semi-joins subset data in the target table by the key values that are present in the joining table.
Let’s filter the count data to counts of species
observed with a insectivorous diet
(diet == insectivore).
counts %>%
semi_join(
birds %>%
filter(diet == 'insectivore'),
by = c('spp' = 'species'))
## # A tibble: 938 × 4
## count_id visit_id spp count
## <chr> <chr> <chr> <dbl>
## 1 p24140 3Fxh65U4fb43f80 WBNU 1
## 2 WrF933 8iXAa9851afd100 CACH 1
## 3 tj6Q6m 8iXAa9851afd100 CARW 1
## 4 M8r390 8iXAa9851afd100 EAPH 1
## 5 e3K50d 8iXAa9851afd100 TUTI 1
## 6 3Q3244 150jN34577eed00 CARW 2
## 7 V12I4K 150jN34577eed00 DOWO 1
## 8 hG9X4o 150jN34577eed00 TUTI 4
## 9 t528Wu 0n075TL4ff23600 CARW 1
## 10 1071D1 0n075TL4ff23600 HOWR 1
## # … with 928 more rows
Anti-joins: Semi-joins subset data in the target table by the key values that are not present in the joining table.
For example, we can subset the count data to counts of
species observed who do not have an insectivorous diet:
counts %>%
anti_join(
birds %>%
filter(diet == 'insectivore'),
by = c('spp' = 'species'))
## # A tibble: 1,702 × 4
## count_id visit_id spp count
## <chr> <chr> <chr> <dbl>
## 1 Nrpl13 3Fxh65U4fb43f80 AMRO 2
## 2 m9U585 3Fxh65U4fb43f80 GRCA 2
## 3 0Wlz34 3Fxh65U4fb43f80 MODO 1
## 4 WNf927 3Fxh65U4fb43f80 NOCA 2
## 5 606emZ 3Fxh65U4fb43f80 RBWO 1
## 6 47f20A 3Fxh65U4fb43f80 SOSP 1
## 7 696vx5 8iXAa9851afd100 AMRO 1
## 8 B4o8a2 8iXAa9851afd100 GRCA 1
## 9 P58660 8iXAa9851afd100 MODO 1
## 10 0411N0 8iXAa9851afd100 NOCA 4
## # … with 1,692 more rows
To better understand joins, I strongly recommend reading the R for Data Science chapter (link) on joining data. It’s one of the most important skills for working with data!